Week 09
Logistic Regression

SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research


Semester 1, 2026
Last updated: 2026-01-23

Francesco Bailo

Acknowledgement of Country

I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.

Learning Objectives

By the end of this week, you will be able to:

  • Fit and interpret logistic regression models
  • Understand and interpret odds ratios and log-odds
  • Make probabilistic predictions for binary outcomes
  • Evaluate classification model performance
  • Use bootstrap for uncertainty estimation

Readings for This Week

TSwD

  • Ch 13.1: Introduction to GLMs
  • Ch 13.2: Logistic regression

ROS

  • Ch 13: Logistic regression
  • Ch 14: Working with logistic regression
  • Ch 5.4: Bootstrapping

Why Logistic Regression?

The Problem with Linear Regression for Binary Outcomes

Linear regression assumes a continuous outcome variable that can take any value on the real line.

But what if our outcome is binary?

  • Did they vote? (yes/no)
  • Did the bill pass? (yes/no)
  • Did the respondent support the policy? (support/oppose)

The Problem

Linear regression can predict probabilities outside the range [0, 1]!

Visualising the Problem

The Bernoulli Distribution

Binary outcomes follow the Bernoulli distribution:

  • Outcome \(y\) is either 0 or 1
  • Probability \(p\) of outcome “1”
  • Probability \(1-p\) of outcome “0”
# Simulating from a Bernoulli distribution
set.seed(853)
bernoulli_draws <- rbinom(n = 20, size = 1, prob = 0.3)
bernoulli_draws
 [1] 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0

The Logit Function

From Probabilities to the Real Line

The logit function transforms probabilities (bounded between 0 and 1) to values on the entire real line:

\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right)\]

Examples:

  • \(\text{logit}(0.1) = -2.2\)
  • \(\text{logit}(0.5) = 0\)
  • \(\text{logit}(0.9) = 2.2\)

The ratio \(\frac{p}{1-p}\) is called the odds.

The Logit Function Visualised

The Inverse Logit Function

The inverse logit (or logistic) function converts values on the real line back to probabilities:

\[\text{logit}^{-1}(x) = \frac{e^x}{1 + e^x}\]

In R, we can use:

# logit function
qlogis(0.5)
[1] 0
# inverse logit function  
plogis(0)
[1] 0.5

The Inverse Logit Function Visualised

The Logistic Regression Model

Model Specification

The logistic regression model predicts the probability that \(y = 1\):

\[\Pr(y_i = 1) = \text{logit}^{-1}(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \ldots)\]

Equivalently:

\[\text{logit}(\Pr(y_i = 1)) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \ldots\]

Key Insight

The coefficients \(\beta\) describe relationships on the log-odds scale, not the probability scale.

Fitting Logistic Regression in R

We use glm() with family = binomial:

# Basic syntax
model <- glm(outcome ~ predictor1 + predictor2, 
             data = my_data,
             family = binomial(link = "logit"))

Or with rstanarm for Bayesian estimation:

# Bayesian logistic regression
model <- stan_glm(outcome ~ predictor1 + predictor2,
                  data = my_data,
                  family = binomial(link = "logit"))

Example: Weekday or Weekend?

Simulated Example

Let’s simulate data on whether it’s a weekday based on the number of cars on the road:

set.seed(853)
traffic_data <- tibble(
  num_cars = sample.int(n = 100, size = 500, replace = TRUE),
  noise = rnorm(n = 500, mean = 0, sd = 10),
  is_weekday = if_else(num_cars + noise > 50, 1, 0)
) |>
  select(-noise)

head(traffic_data)
# A tibble: 6 × 2
  num_cars is_weekday
     <int>      <dbl>
1        9          0
2       64          1
3       90          1
4       93          1
5       17          0
6       29          0

Fitting the Model

traffic_model <- glm(
  is_weekday ~ num_cars,
  data = traffic_data,
  family = binomial(link = "logit")
)

summary(traffic_model)

Call:
glm(formula = is_weekday ~ num_cars, family = binomial(link = "logit"), 
    data = traffic_data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -10.05861    1.19848  -8.393   <2e-16 ***
num_cars      0.20037    0.02311   8.669   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 689.94  on 499  degrees of freedom
Residual deviance: 152.98  on 498  degrees of freedom
AIC: 156.98

Number of Fisher Scoring iterations: 8

Visualising the Fit

traffic_data |>
  mutate(predicted_prob = predict(traffic_model, type = "response")) |>
  ggplot(aes(x = num_cars)) +
  geom_point(aes(y = is_weekday), alpha = 0.3) +
  geom_line(aes(y = predicted_prob), colour = "#1791e8", linewidth = 1.2) +
  labs(x = "Number of cars", y = "Probability (weekday)",
       title = "Logistic regression fit") +
  theme_minimal(base_size = 16)

Interpreting Coefficients

The Challenge of Interpretation

Non-linearity

Logistic regression coefficients are on the log-odds scale. A unit change in \(x\) does not correspond to a constant change in probability.

The same coefficient produces different probability changes depending on where you are on the curve:

  • Near \(p = 0.5\): largest effect
  • Near \(p = 0\) or \(p = 1\): smallest effect

The Divide-by-4 Rule

Quick Interpretation

Divide the logistic regression coefficient by 4 to get the maximum possible effect on probability.

The logistic curve is steepest at its centre (where \(p = 0.5\)), with maximum slope = \(\beta/4\).

Example: If \(\beta = 0.19\), then:

\[\frac{0.19}{4} = 0.0475 \approx 5\%\]

A one-unit increase in \(x\) corresponds to at most a 5% increase in probability.

Interpretation at Specific Values

For more precise interpretation, evaluate the effect at a specific point:

coefs <- coef(traffic_model)
coefs
(Intercept)    num_cars 
-10.0586139   0.2003724 
# Probability at 30 cars
plogis(coefs[1] + coefs[2] * 30)
(Intercept) 
 0.01716716 
# Probability at 50 cars  
plogis(coefs[1] + coefs[2] * 50)
(Intercept) 
  0.4900034 
# Probability at 70 cars
plogis(coefs[1] + coefs[2] * 70)
(Intercept) 
  0.9814299 

Odds Ratios

An alternative interpretation uses odds ratios. Exponentiating the coefficient gives the multiplicative change in odds:

# Coefficient on log-odds scale
coefs[2]
 num_cars 
0.2003724 
# Odds ratio
exp(coefs[2])
num_cars 
1.221858 

Interpretation: For each additional car, the odds of it being a weekday are multiplied by 1.22.

Summary of Coefficient Interpretation

Method Result Interpretation
Raw coefficient \(\beta\) Change in log-odds per unit \(x\)
Divide by 4 \(\beta/4\) Maximum probability change
Odds ratio \(e^\beta\) Multiplicative change in odds
Evaluate at \(\bar{x}\) \(\text{logit}^{-1}(\ldots)\) Probability at specific values

Example: Voting Behaviour

Political Preference and Income

Let’s examine a classic social science question: Does income predict voting for conservative parties?

# Simulate voting data (mimicking NES data structure)
set.seed(42)
n <- 1000
voting_data <- tibble(
  income = sample(1:5, n, replace = TRUE),  # 1=poor, 5=rich
  vote_conservative = rbinom(n, 1, prob = plogis(-1.4 + 0.33 * income))
)

Fitting the Voting Model

vote_model <- glm(vote_conservative ~ income,
                  data = voting_data,
                  family = binomial(link = "logit"))

summary(vote_model)

Call:
glm(formula = vote_conservative ~ income, family = binomial(link = "logit"), 
    data = voting_data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.38827    0.15950  -8.704  < 2e-16 ***
income       0.33935    0.04797   7.074 1.51e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1350.7  on 999  degrees of freedom
Residual deviance: 1298.4  on 998  degrees of freedom
AIC: 1302.4

Number of Fisher Scoring iterations: 4

Visualising Voting Probabilities

Interpreting the Voting Model

coefs_vote <- coef(vote_model)

# Divide by 4 rule
coefs_vote[2] / 4
    income 
0.08483779 
# Probability at each income level
tibble(income = 1:5) |>
  mutate(probability = plogis(coefs_vote[1] + coefs_vote[2] * income))
# A tibble: 5 × 2
  income probability
   <int>       <dbl>
1      1       0.259
2      2       0.330
3      3       0.408
4      4       0.492
5      5       0.577

Each unit increase in income corresponds to at most an ~8% increase in probability of voting conservative.

Making Predictions

Point Predictions

Use predict() with type = "response" for probabilities:

# Predict for new observations
new_data <- tibble(income = c(1, 3, 5))

predict(vote_model, newdata = new_data, type = "response")
        1         2         3 
0.2594333 0.4084894 0.5765163 

Without type = "response", you get predictions on the log-odds scale:

predict(vote_model, newdata = new_data, type = "link")
         1          2          3 
-1.0489160 -0.3702137  0.3084886 

Predictions with Uncertainty

Using marginaleffects::predictions():

library(marginaleffects)

predictions(vote_model, newdata = new_data) |>
  as_tibble() |>
  select(income, estimate, conf.low, conf.high)
# A tibble: 3 × 4
  income estimate conf.low conf.high
   <dbl>    <dbl>    <dbl>     <dbl>
1      1    0.259    0.218     0.306
2      3    0.408    0.378     0.440
3      5    0.577    0.521     0.631

Marginal Effects

The marginal effect tells us how the probability changes for a small change in \(x\):

# Marginal effect at each income level
slopes(vote_model, newdata = new_data, variables = "income") |>
  select(income, estimate, std.error)

 Estimate Std. Error
   0.0652    0.00651
   0.0820    0.01161
   0.0829    0.01054

Note

Notice the marginal effect varies across income levels due to the non-linearity of the logistic function.

Model Evaluation

Residuals for Logistic Regression

Residuals for logistic regression are defined as:

\[\text{residual}_i = y_i - \hat{p}_i\]

Because \(y\) is 0 or 1 and \(\hat{p}\) is between 0 and 1, residuals are discrete:

  • If \(\hat{p} = 0.7\) and \(y = 1\): residual = \(0.3\)
  • If \(\hat{p} = 0.7\) and \(y = 0\): residual = \(-0.7\)

Binned Residuals

Solution: Binned Residuals

Divide observations into bins based on predicted probability, then plot the average residual in each bin.

Classification Accuracy

We can evaluate predictions by classifying at a threshold (typically 0.5):

voting_data <- voting_data |>
  mutate(
    predicted_prob = predict(vote_model, type = "response"),
    predicted_class = if_else(predicted_prob > 0.5, 1, 0)
  )

# Confusion matrix
table(Predicted = voting_data$predicted_class, 
      Actual = voting_data$vote_conservative)
         Actual
Predicted   0   1
        0 506 310
        1  88  96

Classification Metrics

# Calculate accuracy
accuracy <- mean(voting_data$predicted_class == voting_data$vote_conservative)
accuracy
[1] 0.602
# Null model accuracy (always predict most common class)
null_accuracy <- max(mean(voting_data$vote_conservative), 
                     1 - mean(voting_data$vote_conservative))
null_accuracy
[1] 0.594

Compare to Null Model

Always compare your model’s accuracy to the null model (predicting the most common outcome for everyone).

Log Score

A better evaluation metric is the log score (log-likelihood):

# Log score
log_score <- sum(
  voting_data$vote_conservative * log(voting_data$predicted_prob) +
  (1 - voting_data$vote_conservative) * log(1 - voting_data$predicted_prob)
)
log_score
[1] -649.1941
# Log score per observation
log_score / nrow(voting_data)
[1] -0.6491941

Note

Higher (less negative) log scores indicate better fit. The log score properly accounts for the probability predictions, not just the classifications.

Bootstrapping for Uncertainty

What is Bootstrapping?

Bootstrapping is a resampling technique to estimate uncertainty:

  1. Resample the data with replacement
  2. Fit the model on the resampled data
  3. Repeat many times
  4. The distribution of estimates approximates the sampling distribution

Bootstrap Example: Gender Wage Gap

# Simulate earnings data
set.seed(42)
earnings_data <- tibble(
  earn = rlnorm(200, meanlog = 10, sdlog = 0.5),
  male = rbinom(200, 1, 0.5)
) |>
  mutate(earn = if_else(male == 1, earn * 1.2, earn))

# Observed ratio of median earnings
observed_ratio <- median(earnings_data$earn[earnings_data$male == 0]) /
                  median(earnings_data$earn[earnings_data$male == 1])
observed_ratio
[1] 0.76043

Implementing the Bootstrap

# Bootstrap function
boot_ratio <- function(data) {
  n <- nrow(data)
  boot_indices <- sample(n, replace = TRUE)
  boot_data <- data[boot_indices, ]
  median(boot_data$earn[boot_data$male == 0]) /
    median(boot_data$earn[boot_data$male == 1])
}

# Run bootstrap
n_sims <- 1000
boot_results <- replicate(n_sims, boot_ratio(earnings_data))

# Bootstrap standard error
sd(boot_results)
[1] 0.05569499

Bootstrap Distribution

# 95% bootstrap confidence interval
quantile(boot_results, c(0.025, 0.975))
     2.5%     97.5% 
0.6656726 0.8938205 

Logistic Regression with Multiple Predictors

Adding Predictors

Extending the voting model with gender:

# Add gender to our data
set.seed(42)
voting_data <- voting_data |>
  mutate(
    female = rbinom(n(), 1, 0.5),
    vote_conservative = rbinom(n(), 1, 
      prob = plogis(-1.4 + 0.33 * income - 0.4 * female))
  )

vote_model2 <- glm(vote_conservative ~ income + female,
                   data = voting_data,
                   family = binomial(link = "logit"))
summary(vote_model2)

Call:
glm(formula = vote_conservative ~ income + female, family = binomial(link = "logit"), 
    data = voting_data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.23201    0.17285  -7.128 1.02e-12 ***
income       0.32146    0.04878   6.590 4.41e-11 ***
female      -0.50390    0.13551  -3.719    2e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1325.2  on 999  degrees of freedom
Residual deviance: 1265.2  on 997  degrees of freedom
AIC: 1271.2

Number of Fisher Scoring iterations: 4

Interpreting Multiple Predictors

coefs2 <- coef(vote_model2)

# Divide by 4 for quick interpretation
coefs2 / 4
(Intercept)      income      female 
-0.30800219  0.08036449 -0.12597400 
# Probability for poor female vs rich male
prob_poor_female <- plogis(coefs2[1] + coefs2[2] * 1 + coefs2[3] * 1)
prob_rich_male <- plogis(coefs2[1] + coefs2[2] * 5 + coefs2[3] * 0)

c(poor_female = prob_poor_female, rich_male = prob_rich_male)
poor_female.(Intercept)   rich_male.(Intercept) 
              0.1955336               0.5927344 

Visualising Multiple Predictors

Interactions in Logistic Regression

Why Interactions?

An interaction allows the effect of one predictor to depend on the value of another:

vote_model3 <- glm(
  vote_conservative ~ income * female,
  data = voting_data,
  family = binomial(link = "logit")
)

summary(vote_model3)

Call:
glm(formula = vote_conservative ~ income * female, family = binomial(link = "logit"), 
    data = voting_data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -1.25555    0.22118  -5.677 1.37e-08 ***
income         0.32924    0.06674   4.933 8.08e-07 ***
female        -0.45260    0.32867  -1.377    0.168    
income:female -0.01675    0.09780  -0.171    0.864    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1325.2  on 999  degrees of freedom
Residual deviance: 1265.2  on 996  degrees of freedom
AIC: 1273.2

Number of Fisher Scoring iterations: 4

Centering for Interpretability

Best Practice

Centre continuous predictors before including interactions to make coefficients more interpretable.

voting_data <- voting_data |>
  mutate(income_c = income - mean(income))

vote_model3c <- glm(
  vote_conservative ~ income_c * female,
  data = voting_data,
  family = binomial(link = "logit")
)

coef(vote_model3c)
    (Intercept)        income_c          female income_c:female 
    -0.29746851      0.32923891     -0.50132906     -0.01674654 

Now coefficients represent effects at the mean of the other variable.

Summary

Key Takeaways

Logistic Regression

  • For binary outcomes
  • Predicts probabilities bounded [0,1]
  • Coefficients on log-odds scale
  • Use glm() with family = binomial

Interpretation

  • Divide by 4 for max probability effect
  • Exponentiate for odds ratios
  • Evaluate at specific values
  • Marginal effects vary across \(x\)

Key Takeaways (continued)

Model Evaluation

  • Binned residual plots
  • Compare accuracy to null model
  • Log score for probabilistic predictions
  • Cross-validation (LOO, k-fold)

Uncertainty

  • Standard errors for coefficients
  • Confidence intervals for predictions
  • Bootstrap for complex quantities

Next Week

Week 10: Count Models and Multilevel Modelling

  • Poisson and negative binomial regression
  • Overdispersion
  • Introduction to multilevel models

Readings

  • TSwD Ch 13.3-13.5
  • ROS Ch 15

References